{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# LightGBM"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## What is LightGBM\n",
    "\n",
    "[LightGBM](https://github.com/Microsoft/LightGBM) is an open-source,\n",
    "distributed, high-performance gradient boosting (GBDT, GBRT, GBM, or\n",
    "MART) framework. This framework specializes in creating high-quality and\n",
    "GPU-enabled decision tree algorithms for ranking, classification, and\n",
    "many other machine learning tasks. LightGBM is part of Microsoft's\n",
    "[DMTK](https://github.com/microsoft/dmtk) project.\n",
    "\n",
    "### Advantages of LightGBM\n",
    "\n",
    "-   **Composability**: LightGBM models can be incorporated into existing\n",
    "    SparkML pipelines and used for batch, streaming, and serving\n",
    "    workloads.\n",
    "-   **Performance**: LightGBM on Spark is 10-30% faster than SparkML on\n",
    "    the [Higgs dataset](https://archive.ics.uci.edu/dataset/280/higgs) and achieves a 15% increase in AUC.  [Parallel\n",
    "    experiments](https://github.com/Microsoft/LightGBM/blob/master/docs/Experiments.rst#parallel-experiment)\n",
    "    have verified that LightGBM can achieve a linear speed-up by using\n",
    "    multiple machines for training in specific settings.\n",
    "-   **Functionality**: LightGBM offers a wide array of [tunable\n",
    "    parameters](https://github.com/Microsoft/LightGBM/blob/master/docs/Parameters.rst),\n",
    "    that one can use to customize their decision tree system. LightGBM on\n",
    "    Spark also supports new types of problems such as quantile regression.\n",
    "-   **Cross platform**: LightGBM on Spark is available on Spark, PySpark, and SparklyR.\n",
    "\n",
    "### LightGBM Usage\n",
    "\n",
    "- **LightGBMClassifier**: used for building classification models. For example, to predict whether a company bankrupts or not, we could build a binary classification model with `LightGBMClassifier`.\n",
    "- **LightGBMRegressor**: used for building regression models. For example, to predict housing price, we could build a regression model with `LightGBMRegressor`.\n",
    "- **LightGBMRanker**: used for building ranking models. For example, to predict the relevance of website search results, we could build a ranking model with `LightGBMRanker`."
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use `LightGBMClassifier` to train a classification model\n",
    "\n",
    "In this example, we use LightGBM to build a classification model in order to predict bankruptcy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.core.platform import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = (\n",
    "    spark.read.format(\"csv\")\n",
    "    .option(\"header\", True)\n",
    "    .option(\"inferSchema\", True)\n",
    "    .load(\n",
    "        \"wasbs://publicwasb@mmlspark.blob.core.windows.net/company_bankruptcy_prediction_data.csv\"\n",
    "    )\n",
    ")\n",
    "# print dataset size\n",
    "print(\"records read: \" + str(df.count()))\n",
    "print(\"Schema: \")\n",
    "df.printSchema()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split the dataset into train and test sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test = df.randomSplit([0.85, 0.15], seed=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Add a featurizer to convert features into vectors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pyspark.ml.feature import VectorAssembler\n",
    "\n",
    "feature_cols = df.columns[1:]\n",
    "featurizer = VectorAssembler(inputCols=feature_cols, outputCol=\"features\")\n",
    "train_data = featurizer.transform(train)[\"Bankrupt?\", \"features\"]\n",
    "test_data = featurizer.transform(test)[\"Bankrupt?\", \"features\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Check if the data is unbalanced"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "display(train_data.groupBy(\"Bankrupt?\").count())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Model Training"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.lightgbm import LightGBMClassifier\n",
    "\n",
    "model = LightGBMClassifier(\n",
    "    objective=\"binary\", featuresCol=\"features\", labelCol=\"Bankrupt?\", isUnbalance=True\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = model.fit(train_data)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {
    "tags": [
     "hide-synapse-internal"
    ]
   },
   "source": [
    "\"saveNativeModel\" allows you to extract the underlying lightGBM model for fast deployment after you train on Spark."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": [
     "hide-synapse-internal"
    ]
   },
   "outputs": [],
   "source": [
    "from synapse.ml.lightgbm import LightGBMClassificationModel\n",
    "\n",
    "if running_on_synapse():\n",
    "    model.saveNativeModel(\"/models/lgbmclassifier.model\")\n",
    "    model = LightGBMClassificationModel.loadNativeModelFromFile(\n",
    "        \"/models/lgbmclassifier.model\"\n",
    "    )\n",
    "if running_on_synapse_internal():\n",
    "    model.saveNativeModel(\"Files/models/lgbmclassifier.model\")\n",
    "    model = LightGBMClassificationModel.loadNativeModelFromFile(\n",
    "        \"Files/models/lgbmclassifier.model\"\n",
    "    )\n",
    "else:\n",
    "    model.saveNativeModel(\"/tmp/lgbmclassifier.model\")\n",
    "    model = LightGBMClassificationModel.loadNativeModelFromFile(\n",
    "        \"/tmp/lgbmclassifier.model\"\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Visualize feature importance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "feature_importances = model.getFeatureImportances()\n",
    "fi = pd.Series(feature_importances, index=feature_cols)\n",
    "fi = fi.sort_values(ascending=True)\n",
    "f_index = fi.index\n",
    "f_values = fi.values\n",
    "\n",
    "# print feature importances\n",
    "print(\"f_index:\", f_index)\n",
    "print(\"f_values:\", f_values)\n",
    "\n",
    "# plot\n",
    "x_index = list(range(len(fi)))\n",
    "x_index = [x / len(fi) for x in x_index]\n",
    "plt.rcParams[\"figure.figsize\"] = (20, 20)\n",
    "plt.barh(\n",
    "    x_index, f_values, height=0.028, align=\"center\", color=\"tan\", tick_label=f_index\n",
    ")\n",
    "plt.xlabel(\"importances\")\n",
    "plt.ylabel(\"features\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate predictions with the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions = model.transform(test_data)\n",
    "predictions.limit(10).toPandas()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.train import ComputeModelStatistics\n",
    "\n",
    "metrics = ComputeModelStatistics(\n",
    "    evaluationMetric=\"classification\",\n",
    "    labelCol=\"Bankrupt?\",\n",
    "    scoredLabelsCol=\"prediction\",\n",
    ").transform(predictions)\n",
    "display(metrics)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use `LightGBMRegressor` to train a quantile regression model\n",
    "\n",
    "In this example, we show how to use LightGBM to build a regression model."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "triazines = spark.read.format(\"libsvm\").load(\n",
    "    \"wasbs://publicwasb@mmlspark.blob.core.windows.net/triazines.scale.svmlight\"\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# print some basic info\n",
    "print(\"records read: \" + str(triazines.count()))\n",
    "print(\"Schema: \")\n",
    "triazines.printSchema()\n",
    "display(triazines.limit(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Split dataset into train and test sets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train, test = triazines.randomSplit([0.85, 0.15], seed=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train the model using `LightGBMRegressor`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.lightgbm import LightGBMRegressor\n",
    "\n",
    "model = LightGBMRegressor(\n",
    "    objective=\"quantile\", alpha=0.2, learningRate=0.3, numLeaves=31\n",
    ").fit(train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(model.getFeatureImportances())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate predictions with the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "scoredData = model.transform(test)\n",
    "display(scoredData)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.train import ComputeModelStatistics\n",
    "\n",
    "metrics = ComputeModelStatistics(\n",
    "    evaluationMetric=\"regression\", labelCol=\"label\", scoresCol=\"prediction\"\n",
    ").transform(scoredData)\n",
    "display(metrics)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Use `LightGBMRanker` to train a ranking model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Read the dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = spark.read.format(\"parquet\").load(\n",
    "    \"wasbs://publicwasb@mmlspark.blob.core.windows.net/lightGBMRanker_train.parquet\"\n",
    ")\n",
    "# print some basic info\n",
    "print(\"records read: \" + str(df.count()))\n",
    "print(\"Schema: \")\n",
    "df.printSchema()\n",
    "display(df.limit(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Train the ranking model using `LightGBMRanker`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from synapse.ml.lightgbm import LightGBMRanker\n",
    "\n",
    "features_col = \"features\"\n",
    "query_col = \"query\"\n",
    "label_col = \"labels\"\n",
    "lgbm_ranker = LightGBMRanker(\n",
    "    labelCol=label_col,\n",
    "    featuresCol=features_col,\n",
    "    groupCol=query_col,\n",
    "    predictionCol=\"preds\",\n",
    "    leafPredictionCol=\"leafPreds\",\n",
    "    featuresShapCol=\"importances\",\n",
    "    repartitionByGroupingColumn=True,\n",
    "    numLeaves=32,\n",
    "    numIterations=200,\n",
    "    evalAt=[1, 3, 5],\n",
    "    metric=\"ndcg\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lgbm_ranker_model = lgbm_ranker.fit(df)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Generate predictions with the model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "dt = spark.read.format(\"parquet\").load(\n",
    "    \"wasbs://publicwasb@mmlspark.blob.core.windows.net/lightGBMRanker_test.parquet\"\n",
    ")\n",
    "predictions = lgbm_ranker_model.transform(dt)\n",
    "predictions.limit(10).toPandas()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3.8.5 64-bit (conda)",
   "name": "python385jvsc74a57bd072be13fef265c65d19cf428fd1b09dd31615eed186d1dccdebb6e555960506ee"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "orig_nbformat": 2
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
